Language modeling by string pattern n-gram for Japanese speech recognition
نویسندگان
چکیده
This paper describes a new powerful statistical language model based on N-gram model for Japanese speech recognition. In English, a sentence is written word-by-word. On the other hand, a sentence in Japanese has no word boundary character. Therefore, a Japanese sentence requires word segmentation by morphemic analysis before the construction of word N-gram. We propose an N-gram based language model which requires no word segmentation. This model uses character string patterns as units of N-gram. The string patterns are chosen from the training text according to a statistical criterion. We carried out several experiments to compare perplexities of the proposed and the conventional models, which showed the advantage of our model. For many of the readers' interest, we applied this method to English text. As the result of a preliminary experiment, the proposed method got better performance than conventional word trigram.
منابع مشابه
Effects of word string language models on noisy broadcast news speech recognition
In this paper, we present the results that our n-gram based word string language model, combined with speaker and noise adaptation of the acoustic model, improves recognition performance of noisy broadcast news speech. The focus was brought into a remedy against recognition errors of short words. The word string language models based on POS and n-gram frequency reduced deletion errors by 17%, i...
متن کاملN-gram language modeling of Japanese using bunsetsu boundaries
A new scheme of N-gram language modeling was proposed for Japanese, where word N-grams were calculated separately for the two cases: crossing and not crossing bunsetsu boundaries. Here, bunsetsu is a basic grammatical (and pronunciation) unit of Japanese. A similar scheme using accent phrase boundaries instead of bunsetsu boundaries has already been proposed by the authors with a certain succes...
متن کاملMining of association patterns for language modeling
Language modeling using n-gram is popular for speech recognition and many other applications. The conventional ngram suffers from the insufficiencies of training data, domain knowledge and long distance language dependencies. This paper presents a new approach to mining long distance word associations and incorporating their mutual information into language models. We aim to discover the associ...
متن کاملModeling Pronunciation of OOV Words for Speech Recognition
This paper presents a technique for modeling pronunciation in automatic speech recognition using an approach based on statistical machine translation. The task of a pronunciation model in speech recognition is to convert a sequence of phonemes into proper words of the language. This task can be realized as a machine translation approach, whereby the source language is a sequence of phonemes and...
متن کاملSelection of Multi-Word Expressions from Web N-gram Corpus for Speech Recognition
This paper proposes a method for constructing a statistical language model with multi word expressions (MWEs) selected from Google Japanese Web N-gram. MWEs are concatenated words that consist of idiomatic expressions or long-length morpheme sequences used frequently. In this paper a method for selecting the effective MWEs that improve the language model based on co-occurrence probabilities of ...
متن کامل